{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "ee6cd357",
      "metadata": {
        "id": "ee6cd357"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/yt-search/00-data-build.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/yt-search/00-data-build.ipynb)\n",
        "\n",
        "# NLP Powered Youtube Search\n",
        "\n",
        "# Preparing Dataset\n",
        "\n",
        "We will be downloading the initial dataset from Kaggle, for this we need to install the kaggle client via `pip install kaggle`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "247dd055",
      "metadata": {
        "id": "247dd055"
      },
      "outputs": [],
      "source": [
        "!pip install -qU kaggle"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b4a6864c",
      "metadata": {
        "id": "b4a6864c"
      },
      "outputs": [],
      "source": [
        "import kaggle"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f3e3f7e0",
      "metadata": {
        "id": "f3e3f7e0"
      },
      "source": [
        "When first trying to `import kaggle` we will see an error showing us where we need to place a Kaggle API key, we can find our API key by signing up for an account on Kaggle and clicking on our **profile in the top-right > Account > API > Create API token**. This will download a *kaggle.json*, which we must place in the directory mentioned above.\n",
        "\n",
        "*(The dataset is ~10GB in size so feel free to skip this step and download the modified dataset directly from HuggingFace datasets)*\n",
        "\n",
        "Once we have our *kaggle.json* in the correct directory we download the YTTTS speech collection dataset like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8eaaf066",
      "metadata": {
        "id": "8eaaf066"
      },
      "outputs": [],
      "source": [
        "!kaggle datasets download ryanrudes/yttts-speech"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1621910e",
      "metadata": {
        "id": "1621910e"
      },
      "source": [
        "We can unzip the dataset files:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8c92bc42",
      "metadata": {
        "id": "8c92bc42"
      },
      "outputs": [],
      "source": [
        "!unzip yttts-speech.zip"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "140782e4",
      "metadata": {
        "id": "140782e4"
      },
      "source": [
        "To build the full dataset we need to work through a few additional steps and install a few more libraries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5rF8KOj7ZsF0",
      "metadata": {
        "id": "5rF8KOj7ZsF0"
      },
      "outputs": [],
      "source": [
        "!pip install -qU bs4 tqdm datasets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2eb80adb",
      "metadata": {
        "id": "2eb80adb"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import time\n",
        "import requests\n",
        "from tqdm.auto import tqdm\n",
        "from bs4 import BeautifulSoup"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f8e42f19",
      "metadata": {
        "id": "f8e42f19"
      },
      "source": [
        "The current dataset is organized into a set of directories containing folders named based on video IDs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "67c67b96",
      "metadata": {
        "id": "67c67b96",
        "outputId": "e768ab78-d1bc-496a-e8ae-975f5d1935d2"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['ZPewmEu7644', 'g4M7stjzR1I', 'P0yVuoATjzs', 'EkzZSaeIikI', 'pWAc9B2zJS4']"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "video_ids = os.listdir(\"data\")\n",
        "video_ids[:5]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7815d479",
      "metadata": {
        "id": "7815d479"
      },
      "source": [
        "Inside each of these we find many more directories where each represents a timestamp pulled from the video."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "97c3c082",
      "metadata": {
        "id": "97c3c082",
        "outputId": "ed3419e8-c32f-4fc3-bb5b-9bf214da2799"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['00:00:00,030-00:00:02,040',\n",
              " '00:00:02,040-00:00:03,570',\n",
              " '00:00:03,570-00:00:05,670',\n",
              " '00:00:05,670-00:00:07,230',\n",
              " '00:00:07,230-00:00:09,120']"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "splits = sorted(os.listdir(f\"data/{video_ids[0]}\"))\n",
        "splits[:5]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "98df4e84",
      "metadata": {
        "id": "98df4e84"
      },
      "source": [
        "In here we have the text transcription itself."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "99ba9382",
      "metadata": {
        "id": "99ba9382",
        "outputId": "2f166c7d-b517-45f3-97b1-4f77011f21e3"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "hi this is Jeff Dean welcome to\n"
          ]
        }
      ],
      "source": [
        "with open(f\"data/{video_ids[0]}/{splits[0]}/subtitles.txt\") as f:\n",
        "    text = f.read()\n",
        "print(text)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "29d9ab6e",
      "metadata": {
        "id": "29d9ab6e"
      },
      "source": [
        "We will loop through all of these files to give us the initial core dataset consisting of *video_id*, *text*, *start_second*, *end_second*, and *url*."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d70e2a72",
      "metadata": {
        "id": "d70e2a72",
        "outputId": "f49e13bc-5d4b-45d2-a37e-0f36a81978a8"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 127/127 [00:19<00:00,  6.60it/s]\n"
          ]
        }
      ],
      "source": [
        "documents = []\n",
        "for video_id in tqdm(video_ids):\n",
        "    splits = sorted(os.listdir(f\"data/{video_id}\"))\n",
        "    # we start at 00:00:00\n",
        "    start_timestamp = \"00:00:00\"\n",
        "    passage = \"\"\n",
        "    for i, s in enumerate(splits):\n",
        "        with open(f\"data/{video_id}/{s}/subtitles.txt\") as f:\n",
        "            # append tect to current chunk\n",
        "            out = f.read()\n",
        "            passage += \" \" + out\n",
        "        # average sentence length is 75-100 characters so we will cut off\n",
        "        # around 3-4 sentences\n",
        "        if len(passage) > 360:\n",
        "            # now we've hit the needed length create a record\n",
        "            # extract the end timestamp from the filename\n",
        "            end_timestamp = s.split(\"-\")[1].split(\",\")[0]\n",
        "            # extract string timestamps to actual datetime objects\n",
        "            start = time.strptime(start_timestamp,\"%H:%M:%S\")\n",
        "            end = time.strptime(end_timestamp,\"%H:%M:%S\")\n",
        "            # now we extract the second/minute/hour values and convert\n",
        "            # to total number of seconds\n",
        "            start_second = start.tm_sec + start.tm_min*60 + start.tm_hour*3600\n",
        "            end_second = end.tm_sec + end.tm_min*60 + end.tm_hour*3600\n",
        "            # save this to the documents list\n",
        "            documents.append({\n",
        "                \"video_id\": video_id,\n",
        "                \"text\": passage,\n",
        "                \"start_second\": start_second,\n",
        "                \"end_second\": end_second,\n",
        "                \"url\": f\"https://www.youtube.com/watch?v={video_id}&t={start_second}s\",\n",
        "            })\n",
        "            # now we update the start_timestamp for the next chunk\n",
        "            start_timestamp = end_timestamp\n",
        "            # refresh passage\n",
        "            passage = \"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1669785d",
      "metadata": {
        "id": "1669785d",
        "outputId": "9f60169b-8b9c-4062-849e-bcd5e7d1db0d"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'video_id': 'ZPewmEu7644',\n",
              "  'text': \" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\",\n",
              "  'start_second': 0,\n",
              "  'end_second': 20,\n",
              "  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'},\n",
              " {'video_id': 'ZPewmEu7644',\n",
              "  'text': ' often see them use for they can definitely generate other types of images but they can also work on tabular data and really any sort of data where you are attempting to have a neural network that is generating data that should be real or should or could be classified as fake the key element to having something as again is having that discriminator that tells the difference',\n",
              "  'start_second': 20,\n",
              "  'end_second': 41,\n",
              "  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=20s'},\n",
              " {'video_id': 'ZPewmEu7644',\n",
              "  'text': \" in the generator that actually generates the data another area that we are seeing ganz use for a great deal is in the area of semi supervised training so let's first talk about what semi-supervised training actually is and see how again can be used to implement this first let's talk about supervised training and unsupervised training which you've probably seen in previous machine\",\n",
              "  'start_second': 41,\n",
              "  'end_second': 64,\n",
              "  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=41s'}]"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "documents[:3]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0d7b57cf",
      "metadata": {
        "id": "0d7b57cf"
      },
      "source": [
        "We also need additional video metadata that cannot be pulled from our dataset, like the video *title* and *thumbnail*. For both of these we can scrape the data using Beautiful Soup."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "eed85411",
      "metadata": {
        "id": "eed85411",
        "outputId": "f6e79a97-555d-433c-b6dc-184ddae060fd"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            " 51%|█████     | 65/127 [02:56<02:01,  1.96s/it]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "'NoneType' object has no attribute 'get'\n",
            "fpDaQxG5w4o\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            " 52%|█████▏    | 66/127 [03:00<02:42,  2.67s/it]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "'NoneType' object has no attribute 'get'\n",
            "arbbhHyRP90\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 127/127 [05:21<00:00,  2.54s/it]\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "127"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import lxml  # if on mac, pip/conda install lxml\n",
        "\n",
        "metadata = {}\n",
        "for _id in tqdm(video_ids):\n",
        "    r = requests.get(f\"https://www.youtube.com/watch?v={_id}\")\n",
        "    soup = BeautifulSoup(r.content, 'lxml')  # lxml package is used here\n",
        "    try:\n",
        "        title = soup.find(\"meta\", property=\"og:title\").get(\"content\")\n",
        "        thumbnail = soup.find(\"meta\", property=\"og:image\").get(\"content\")\n",
        "        metadata[_id] = {\"title\": title, \"thumbnail\": thumbnail}\n",
        "    except Exception as e:\n",
        "        print(e)\n",
        "        print(_id)\n",
        "        metadata[_id] = {\"title\": \"\", \"thumbnail\": \"\"}\n",
        "\n",
        "len(metadata)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fa95e454",
      "metadata": {
        "id": "fa95e454",
        "outputId": "163f0489-3843-4b3d-b21a-2fd84ecc2915"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'video_id': 'ZPewmEu7644',\n",
              " 'text': \" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\",\n",
              " 'start_second': 0,\n",
              " 'end_second': 20,\n",
              " 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'}"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "documents[0]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4d41bda8",
      "metadata": {
        "id": "4d41bda8",
        "outputId": "d5c173cf-192b-4a53-8ae2-4b3569209030"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',\n",
              " 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "metadata['ZPewmEu7644']"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f7d121f0",
      "metadata": {
        "id": "f7d121f0"
      },
      "outputs": [],
      "source": [
        "for i, doc in enumerate(documents):\n",
        "    _id = doc['video_id']\n",
        "    meta = metadata[_id]\n",
        "    # add metadata to existing doc\n",
        "    documents[i] = {**doc, **meta}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e52402d8",
      "metadata": {
        "id": "e52402d8",
        "outputId": "ba7f44ad-4bc8-464c-c6b4-3f8ede39c188"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'video_id': 'ZPewmEu7644',\n",
              " 'text': \" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\",\n",
              " 'start_second': 0,\n",
              " 'end_second': 20,\n",
              " 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s',\n",
              " 'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',\n",
              " 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "documents[0]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "40f291e9",
      "metadata": {
        "id": "40f291e9"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "with open(\"train.jsonl\", \"w\") as f:\n",
        "    for doc in documents:\n",
        "        json.dump(doc, f)\n",
        "        f.write('\\n')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0cf4c356",
      "metadata": {
        "id": "0cf4c356"
      },
      "outputs": [],
      "source": [
        "with open(\"train.jsonl\") as f:\n",
        "    d = f.readlines()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c250d904",
      "metadata": {
        "id": "c250d904",
        "outputId": "3d759a88-e2a1-4673-868b-c2d78802085e"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['{\"video_id\": \"ZPewmEu7644\", \"text\": \" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we\\'re going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan\\'s have a wide array of uses beyond just the face generation that you\", \"start_second\": 0, \"end_second\": 20, \"url\": \"https://www.youtube.com/watch?v=ZPewmEu7644&t=0s\", \"title\": \"GANS for Semi-Supervised Learning in Keras (7.4)\", \"thumbnail\": \"https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg\"}\\n',\n",
              " '{\"video_id\": \"ZPewmEu7644\", \"text\": \" often see them use for they can definitely generate other types of images but they can also work on tabular data and really any sort of data where you are attempting to have a neural network that is generating data that should be real or should or could be classified as fake the key element to having something as again is having that discriminator that tells the difference\", \"start_second\": 20, \"end_second\": 41, \"url\": \"https://www.youtube.com/watch?v=ZPewmEu7644&t=20s\", \"title\": \"GANS for Semi-Supervised Learning in Keras (7.4)\", \"thumbnail\": \"https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg\"}\\n',\n",
              " '{\"video_id\": \"ZPewmEu7644\", \"text\": \" in the generator that actually generates the data another area that we are seeing ganz use for a great deal is in the area of semi supervised training so let\\'s first talk about what semi-supervised training actually is and see how again can be used to implement this first let\\'s talk about supervised training and unsupervised training which you\\'ve probably seen in previous machine\", \"start_second\": 41, \"end_second\": 64, \"url\": \"https://www.youtube.com/watch?v=ZPewmEu7644&t=41s\", \"title\": \"GANS for Semi-Supervised Learning in Keras (7.4)\", \"thumbnail\": \"https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg\"}\\n']"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "d[:3]"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "00_data_build.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.7 (main, Sep 14 2022, 22:38:23) [Clang 14.0.0 (clang-1400.0.29.102)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
